-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
graph: introduce internal dnnl_sdpa op #2930
Conversation
a2f5367
to
6145d95
Compare
What were the main optimizations that led to the 2x performance gain? |
Firstly, the compilation performance bottle neck is layout_propagation and memory_planning pass. Layout propagation for each op will try to create primitive descriptor to get the optimal layout. Memory planning will go through the whole graph to plan the in/out/internal memory size for subgraph. This process will try to create a bunch of md. |
make test |
4560e25
to
e0ded0f
Compare
make test |
@xiang1guo please fix the clang tidy warnings: clang-tidy warnings
|
e0ded0f
to
71905d2
Compare
Thanks for the remind, fixed and reorganized the commit, please check again. The remaining clang-tidy warnings seems not related to the PR changes. |
make test |
Description
Compiled graph before this PR
Compiled graph after this PR
Works
Follow-up
There will be another PR to refine the GQA pattern based on this new internal dnnl_sdpa.
Validation
Correctness check
There are total 66 case can be supported by GPU ukernel, those 42 float SDPA cases can run into sdp_primitive_v1_kernel_t now, the other 24 cases are quantization SDPA which all run into sdp_primitive_kernel_t .
Performance test